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Abstract 

/3-barrel membrane proteins are found in the outer membrane of gram-negative bacteria, mitochon- 
dria, and chloroplasts. Little is known about how residues in membrane /3-barrels interact preferentially 
with other residues on adjacent strands. We have developed probabilistic models to quantify propensities 
of residues for different spatial locations and for interstrand pairwise contact interactions involving strong 
H-bonds, side-chain interactions, and weak H-bonds. Using the reference state of exhaustive permutation 
of residues within the same /3-strand, the propensity values and p- values measuring statistical significance 
are calculated exactly by analytical formulae we have developed. Our findings show that there are char- 
acteristic preferences of residues for different membrane locations. Contrary to the "positive-inside" rule 
for helical membrane proteins, /3-barrel membrane proteins follow a significant albeit weaker "positive- 
outside" rule, in that the basic residues Arg and Lys are disproportionately favored in the extracellular 
cap region and disfavored in the periplasmic cap region. We find that different residue pairs prefer strong 
backbone H-bonded interstrand pairings (e.g. Gly- Aromatic) or non-H-bonded pairings {e.g. Aromatic- 
Aromatic). In addition, we find that Tyr and Phe participate in aromatic rescue by shielding Gly from 
polar environments. We also show that these propensities can be used to predict the registration of strand 
pairs, an important task for the structure prediction of /3-barrel membrane proteins. Our accuracy of 
44% is considerably better than random (7%) . It also significantly outperforms a comparable registration 
prediction for soluble /3-sheets under similar conditions. Our results imply several experiments that can 
help to elucidate the mechanisms of in vitro and in vivo folding of /3-barrel membrane proteins. The 
propensity scales developed in this study will also be useful for computational structure prediction and 
for folding simulations. 

The most interesting part about the computational models is contained in the supplementary infor- 
mation after the bibliography towards the end of the paper. 
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1 Introduction 



Integral membrane proteins can be categorized into two structural classes: a-helical proteins and /3-barrel 
proteins. /3-barrel proteins are found in the outer membrane of gram-negative bacteria, mitochondria, 
and chloroplasts [1-3] . It is estimated that membrane /3-barrels constitute about 2-3% of all proteins in 
the genomes of gram-negative bacteria [3] . Many bacterial pore- forming exotoxins such as a- hemolysin of 
Staphylococcus aureus and protective antigen of Bacillus anthracis are also /3-barrel membrane proteins 
[4,5]. /3-barrel proteins have diverse biological functions, e.g. pore formation, membrane anchoring, 
enzyme activity, and bacterial virulence [6]. Their medical relevance is immediate, as membrane proteins 
in bacteria provide candidate molecular targets for the development of antimicrobial drugs and vaccines. 

With recent rapid progress in the development of computational tools, /3-barrel membrane proteins 
can be reliably identified from sequences [7-9]. However, there are less than 30 structures currently 
known at atomic details. Based on knowledge of these structures, /3-barrel membrane proteins are known 
to contain an even number of strands (8 to 22). These strands are arranged in an antiparallel /3-sheet 
that is twisted and tilted into a barrel structure, usually with short /3-turn-like periplasmic loops and 
long extracellular loops [10]. A /3-barrel protein may exist either as a monomer or as an oligomer. A 
set of simple rules has been compiled by Schulz to describe the structural features of transmembrane 
/3-barrels [10,11]. 

Compared to helical membrane proteins, the nature of the assembly of /3-barrels is likely to be 
different, as there are many more polar and ionizable residues intervening in the transmembrane (TM) 
region. Indeed, several studies have already identified significant differences in folding between /3-barrel 
and a-helical membrane proteins [1,2, 12]. This complexity makes the prediction of the TM strands of 
/3-barrel proteins from sequence more difficult than for helical proteins, as simple rules based on stretches 
of hydrophobicity are no longer effective [7, 9, 13]. Furthermore, because of the low sequence identity for 
TM strands in /3-barrel membrane proteins, comparative modeling cannot generate reliable structures. 

Despite these difficulties, recent studies have identified characteristic amino acid preferences for dif- 
ferent regions of the /3-barrel [9, 13-15,40]. For example, polar residues are found to be enriched in the 
internal face of the barrel and the solvent exposed regions of the protein extending past the membrane 
bilayer. Large aliphatic residues are enriched in the external face of the barrel, and bulky aromatic 
residues tend to prefer the external face at the headgroup regions, forming "girdles" at the membrane 
interface [10, 14]. Bigelow et al. also found that Tyr prefers the C-terminal over the N-terminal of a TM 
strand [7]. 

A number of important questions remain unanswered. For example, little is known about specific 
interactions between residues on adjacent TM strands. In contrast, several studies have identified various 
preferred two-residue interstrand interactions in soluble /3-sheets [18, 19]. A further challenging question 
is whether the requirements of thermodynamic stability in the folding and in vivo sorting, targeting, and 
translocation of /3-barrel membrane proteins lead to the avoidance of certain types of spatial patterns. 
Finally, we do not know how residues in /3-barrel membrane proteins interact with specific regions of 
lipid bilayers. 

To answer these important questions, we have developed methods to estimate the positional propen- 
sity of residues for different spatial regions of the barrel and to estimate the propensity for interstrand 
pairwise spatial interactions. Our findings show that there are additional, previously unknown charac- 
teristic preferences of residues for different locations. Contrary to the "positive-inside" rule for a-helical 
membrane proteins, /3-barrel proteins follow the "positive-outside" rule, in that the basic residues Arg 
and Lys are disproportionately favored in the extracellular cap region and disfavored in the periplasmic 
cap region. We also find several notable interstrand pairwise motifs. The Gly-Tyr and Gly-Phe mo- 
tifs, found in interstrand pairs sharing backbone H-bonds, reveal the existence of "aromatic rescue" in 
/3-barrel membrane proteins, in which a large aromatic side-chain shields Gly from a polar environment, 
a behavior first discovered in soluble /3-sheets [20]. The propensity scale of interstrand pairwise contact 
interactions we developed (called TransMembrane Strand Interaction Propensity [Tmsip]) is based on 
a physical model of strand interactions involving backbone strong H-bond interactions, non-H-bonded 
side-chain interactions, and weak H-bond interactions. We also show that this propensity scale can be 
used to predict the registration of strand pairs, an important task for the structure prediction of /3-barrel 
membrane proteins. Our accuracy of 44% is better than random (7%), and it significantly outperforms a 
comparable registration prediction for soluble /3-sheets under similar conditions. Our results also suggest 
several experimentally testable hypotheses about the in vitro and in vivo folding of /3-barrel membrane 
proteins. 
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Figure 1: Definitions of boundaries of regions and summary of residue region-specific preferences shown on the structure 
of the 8-strand barrel of OmpA. Regions, defined by distance from the barrel center, are defined as follows: Extracellular 
Cap = 13.5 A to 20.5 A Extracellular Headgroup = 6.5 A to 13.5 A Core = —6.5 A to 6.5 A Periplasmic Headgroup = 
— 13.5 A to —6.5 A Periplasmic Cap = —20.5 A to —13.5 APreferred residues (those with high odds ratios) in each region 
are listed on the left. For reference, one internal residue (Lys 12, on the right) and one external residue (Leu 79, on the 
left) are shown. 



2 Results 

Our study is based on a very small dataset of known structures, comprising only 19 /3-barrel membrane 
proteins. It is a very challenging task to identify significant patterns from so little data. To analyze small 
sample data, our approach is to employ a rigorous statistical model that enables the exact calculation 
of p-values from observed patterns. With the aid of a properly calculated p- value, we can then identify 
patterns that are truly significant, and discard spurious findings. This is critical for spatial pattern 
discovery in /3-barrel membrane proteins, and ours is the first study to employ such rigorous methods. 

2.1 Spatial regional preference of residues 
2.1.1 Spatial regions of TM barrels 

We define eight distinct spatial regions for each TM strand based on the vertical distance along the 
membrane normal and the orientation of the side-chain (inside- facing or outside- facing, Figure 1). After 
placing the origin of the reference frame at the midpoint of the membrane bilayer and taking the vertical 
axis perpendicular to the bilayer as the z-axis, we measure the vertical distance of each residue from 
the barrel center (the xy plane where z = 0) along the z-axis. We follow the definition of Wimley [14] 
and define the core region as all residues whose a-carbons are within 6.5 A of the barrel center. The 
periplasmic and extracellular headgroup regions are similarly defined for the space 6.5-13.5 A away from 
the barrel center on either side of the core region. These two headgroup regions are distinguished based 
on information in the original paper of the protein structure. Usually, both the N- and C-termini of the 
peptide chain as well as the short loop regions are found on the periplasmic side of the /3-barrel. The 
core and two headgroup regions are further divided into internal and external residues, depending on 
whether their side-chains face into or away from the center of the barrel, respectively. The periplasmic 
and extracellular cap regions are defined as the space 13.5-20.5 A away from the barrel center. These 
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cap regions are similar to the membrane- water interface studied by Granseth et al. [22]. Since the cap 
regions are not located in the membrane domain, they are not divided into internal and external regions. 

2.1.2 Region specific single-body propensities 

Single-body propensities can be used to reflect the preference of different types of residues for different 
spatial regions of the transmembrane (TM) barrel. We define the single-body propensity of an amino acid 
residue type as an odds ratio comparing the frequency of this residue type in one region to its frequency 
in all eight regions combined. Results of residue single-body propensities are listed in Table 1, along with 
statistical significance in the form of p-values. These two-tailed p-values represent the probability that 
an odds-ratio will be more extreme than the observed one in the reference model, in which the positions 
of the residues in all proteins in the dataset are equally likely to take any one combination when they 
are exhaustively permuted. 

2.1.3 Location preference of residues 

Similar to previous studies [14] , we find that aromatic residues have a strong preference for the external 
extracellular and periplasmic headgroup regions. Long aliphatic residues are favored in the external- 
facing core regions, and polar residues in all internal-facing regions as well as both cap regions. Cys 
residues do not occur in any TM region or the periplasmic cap region, consistent with earlier findings for 
helical membrane proteins [23]. The favorable propensity of Cys for the extracellular cap in Table 1 is 
due to a single cysteine cross-bridge in LamB (pdb 2mpr) and is not statistically significant. 

We find that Pro prefers to be in the periplasmic cap region, because Pro residues are enriched in 
the short /3-turns found in this region. Phe also prefers this region, despite its generally low propensity 
for /3-turn regions [26]. While the a-carbon of the Phe residue is situated in the periplasmic cap and 
therefore does not contribute to the TM /3-sheet, the aromatic side-chain extends into the TM region 
and contributes to the "aromatic girdle" by "anti-snorkeling." The preference of Phe for the periplasmic 
cap region has been described for both a- helical [25] and /3-barrel membrane proteins [27] . 

Our study provides novel insights about the location preferences of residues, as well as the significance 
of these preferences as measured by p- values (Table 1), which have not been provided in previous studies 
[14, 16]. We discuss several novel findings in more detail below. 

The positive-outside rule: asymmetric distribution of basic residues in the cap re- 
gions. The propensities for basic amino acids are very different between the periplasmic and extra- 
cellular cap regions. Arg and Lys have a lower preference for the periplasmic cap region (0.61 and 0.86 
respectively) than for the extracellular cap regions (1.54 for both), despite the general "rule" that polar 
residues prefer both cap regions. Thus basic residues are more than twice as likely to be found in the 
extracellular cap than in the periplasmic cap. In contrast, acidic residues (Asp and Glu) are prevalent 
and evenly distributed in both cap regions. This pattern is analogous to the "positive-inside" rule for 
a-helical membrane proteins, which states that acidic residues are evenly distributed at both ends of a 
TM helix, but basic residues favor the cis side of the membrane, i.e. the side in which the helix inserts 
[24]. Because /3-barrel membrane proteins insert into the outer membrane from the periplasmic side [12], 
the asymmetric distribution of basic residues described here is the opposite of the positive-inside rule, 
and thus we have named it the "positive-outside" rule. 

2.2 Strand interactions: pairwise spatial motifs and antimotifs 

Single-residue propensities cannot capture information about interactions between residues on different 
strands. Little is known about whether there is any specific preference for residues to interact across 
adjacent strands. This is in contrast to helical membrane proteins, where helical interactions are the 
subject of several studies [23, 28-32]. 

2.2.1 Model of strand interactions: Strong H-bonds, side-chain interactions, and 
weak H-bonds 

Strands in /3-barrel membrane proteins have 7 or more residues with a clear internal face and external 
face, fnterstrand interactions in antiparallel /3-sheets are characterized by a periodic dyad bonding 
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Table 1: Single-body propensities by region with significance level measured by p- values. Odds ratio entries listed 
as "-" indicate that no residues of that type occurred in the region in the dataset. p- value entries listed as "-" are 
not significant at the threshold of 0.05. 

repeat pattern, which consists of units of two residues (Figure 2). We characterize interstrand pairwise 
interactions with three different types of interactions, namely, strong regular hydrogen bonds between 
backbone N and O atoms, "non-H-bonded" side-chain interactions (interactions without strong backbone 
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Figure 2: Repeating pattern of interstrand interactions and alternating internal and external side-chains. Red = strong 
H-bonds, Blue = non-H-bonded interactions, Green = weak H-bonds. Residues i and j in the center constitute one dyad 
(boxed). Residue i interacts in a strong N-0 H-bond with residue y on its right and a non-H-bonded interaction with 
residue b on its left. In contrast, residue j interacts in a strong H-bond with residue c on its left and a non-H-bonded 
interaction with residue z on its right. This alternating dyad pattern repeats throughout each strand. In addition, each 
residue also interacts through weak C Q -0 H-bonds to two residues, one on each side. In this case, i shares a weak H-bond 
with both a and x, and j with both b and y. Thus, residue j interacts through weak H-bonds to the "bridge partners" 
(strong H-bond and non-H-bonded interactions, in this case b and y) of the next residue in the strand, i, in the N-C 
direction (as shown by arrows at the top). 

N-O hydrogen bonds), and weak C a -0 hydrogen bonds between the C a atom of one residue and a 
backbone oxygen on another strand. These three interstrand interactions occur periodically and are 
depicted in Figure 2. 

The strong N-0 hydrogen bonds occur between residues on adjacent strands. The non-H-bonded in- 
teractions alternate with the strong hydrogen bonds along adjacent /3-strands. The weak C a -0 hydrogen 
bonds are displaced one residue along adjacent /3-strands. Recent studies by Ho and Curmi on soluble 
/3-sheets have shown that these weak hydrogen bonds have a significant impact on the arrangement of 
residues on adjacent strands, such that bonded residues are not immediately across from each other, but 
slightly displaced along the axis of the strand [33]. This effect may be responsible for the twisting of 
/3-sheets. The role of weak C a -0 H-bonds has also been well studied in helical membrane proteins [34]. 

The three types of interactions used in this study are defined by their relationship to the backbone 
hydrogen-bonding pattern in /3-sheets. Unless otherwise specified, we use "strong H-bond" to refer to 
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Table 2: Pairwise interstrand spatial motifs and antimotifs with propensities and p- values by interaction type. Only 
motifs significant at the threshold p-value of 0.05 are listed. Obs. = observed number of pairs, Exp. = expected 
number of pairs calculated as described in methods. 

a pair of residues that share strong N-O backbone H-bonds, "non-H-bonded interaction" to a pair of 
residues on adjacent strands that interact but do not share backbone H-bonds, and "weak H-bond" to a 
pair of residues that share weak C-O backbone H-bonds (Figure 2). 

2.2.2 Interstrand pairwise interaction propensity, spatial motifs and antimotifs 

We study interstrand interactions to identify spatial patterns in the form of pairing residues across 
strands with a strong preference to form hydrogen bonds, non-H-bonded interactions, and weak C Q -0 
hydrogen bonds. We calculate TransMembrane Strand Interaction Propensity (Tmsip) scales, which are 
the values of odds ratios comparing the observed frequency of interstrand contact interactions to the 
expected frequency. Among these contact interactions, we define spatial motifs as two-body interactions 
whose propensity is greater than 1.2 {i.e. favored) and statistically significant, and spatial antimotifs 
as two-body interactions whose propensity is less than 0.8 {i.e. disfavored) and statistically significant. 
Table 2 lists the Tmsip values and the p- values of motifs and antimotifs for each interaction type (strong 
H-bonds, non-H-bonded interactions, and weak H-bonds) at the significance level of p < 0.05. A full list 
of all residue pairs is included in Supplementary Material. 

For strong backbone H-bond interactions, small residues and aromatic residues often have a strong 
propensity to interact, such as G-Y and G-F. Polar-polar residue interactions are also strongly favored, 
including N-D and K-S. In addition, branched hydrophobic residues and aromatic residues frequently form 
spatial motifs (I-Y, L-W, and L-Y). In contrast, aromatic-aromatic interactions are strongly disfavored 
for strong H-bond interactions: Y-Y is an antimotif, and W-Y is also strongly disfavored (propensity 
0.21, p- value 0.06). This likely reflects the effects of bulky aromatic side-chains. 

For non-backbone-H-bonded interactions, the W-Y motif stands out as the one with the most sig- 
nificant p- value among all spatial motifs (propensity 2.71, p- value 4xl0 -7 ). Small residues form spatial 
motifs with branched residues and Gin (G-I, G-V, Q-G, and A-V). Other motifs of non-H-bonded inter- 
actions involve Leu (L-L and L-P). 

Weak H-bond interactions have fewer spatial motifs. These involve charged residues (D-T, E-M, and 



7 




Figure 3: Clustering of residues based on single and pairwise propensities. G and P are assigned to their own clusters. 
The remaining three clusters are Polar (N, D, H, C, M, R, K, Q, T, E, and S), Aliphatic (A, I, L, and V), and Aromatic 
(Y, F, and W). 



D-P), or Pro residues (G-P and D-P). 

2.2.3 Propensity and spatial motifs using a reduced alphabet 

As seen in Table 2, residues of similar shape or physicochemical properties often behave similarly in 
forming high propensity spatial interactions. It is known that a reduced alphabet of fewer than 20 
types of amino acid residues is often adequate for a protein to perform its biological functions [35], for 
computational recognition of homologous protein sequences [36] , and for recognition of native structures 
from decoys [37] . A reduced alphabet would also be useful for statistical studies using limited data, as it 
would increase statistical significance for groups of related amino acids. 

To further understand the nature of interstrand propensity and spatial motifs, we recalculated in- 
terstrand propensity using a simplified amino acid alphabet. An objective way to obtain a reduced 
alphabet is to group amino acids with similar location preference and strand pairing propensities. We 
cluster amino acids using the results obtained in the single and two-body studies, as described in Supple- 
mentary Material. The resulting clusters forming the reduced alphabet (Figure 3) are as follows: Gly by 
itself, Pro by itself, polar residues (Arg, Asn, Asp, Cys, Gin, Glu, His, Lys, Met, Ser, and Thr), aliphatic 
residues (Ala, He, Leu, and Val), and aromatic residues (Phe, Trp, and Tyr). Odds ratios and p- values 
were then obtained for strand interactions between residue clusters using the same method for the full 
amino acid alphabet. Table 3 summarizes the results obtained using this reduced alphabet. 

Several general patterns emerge. The most dramatic differences in strand pairing between different 
types of interactions are those between strong H-bonds and non-H-bonded interactions. Gly- Aromatic 
and Aliphatic- Aromatic pairs have much higher preferences for strong H-bonds than for non-H-bonded 
interactions, while Aromatic- Aromatic and Gly- Aliphatic pairs have higher preferences for non-H-bonded 
interactions. Smaller differences are detected for Polar-Aliphatic pairs (higher in strong H-bonds) and 
for Polar- Aromatic and Aliphatic- Aliphatic pairs (higher in non-H-bonded interactions). Weak H-bond 
interactions did not show very strong preferences in this analysis, except for the G-P pair already reported 
in Table 2. 

The high-propensity Aromatic- Aromatic non-H-bonded interactions include W-Y, the highest-propensity 
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Table 3: Odds ratios for interstrand residue pairs using a reduced amino acid alphabet. Residue clusters forming 
the reduced alphabet are defined in Figure 3. p- value entries listed as "— " were not significant at the threshold of 
0.05. Aliph. = Aliphatic, Arom. = Aromatic, Obs. = observed number of pairs, Exp. = expected number of pairs 
calculated as described in methods. 




Figure 4: Two examples of interstrand spatial motifs in /3-barrel membrane proteins: a) An instance of the WY non- 
H-bonded interaction motif in LamB. The aromatic side-chains of Trp and Tyr show considerable contact interaction, 
b) An instance of the GY strong H-bonded interaction motif in NspA. The protein has been tilted to show the motif 
on the internal side of the barrel. The aromatic side-chain of Tyr interacts with the Gly residue on the adjacent 
strand. This is an example of "aromatic rescue." 

pairwise spatial motif. An example of the W-Y motif (Figure 4a) may help to explain why this family 
of interactions has such high propensity. The Tyr and Trp residues involved in this motif often align 
their side-chains to maximize contact between their aromatic rings and between their polar groups. At 
the same time, the bulky side-chains cause steric hindrance preventing direct backbone N-O hydrogen 
bonds, and thus these pairs are disfavored in strong backbone H-bond interactions. 

2.2.4 Internal and external preference of pairwise interactions 

We find that there are clear preferences for some residue pairs to face either the outside or inside of 
the barrel. Pairs of bulky aromatic residues (Phe, Tyr, and Trp) are almost always on the outside of 
the barrel, due to the steric hindrance in the barrel interior that would result. All 26 of the strong 
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H-bonded Aromatic- Aromatic interacting pairs occur on the outside of the barrel, as do 55 of the 62 
Aromatic- Aromatic pairs that form non-H-bonded interactions. The preference of Gly-Gly pairs to face 
the inside of the barrel is also significant for non-H-bonded interacting pairs (17 out of 19 occur on the 
inside). However, no clear preference is seen for Gly-Gly strong H-bonds: Only 8 of 16 pairs occur on the 
inside. For strong H-bonded pairs between Gly and aromatic residues, 48 out of 63 pairs are internal, 
whereas there is no preference for non-H-bonded pairs (14 out of 25 are internal). This may reflect a 
preference for "aromatic rescue" of Gly, discussed below. The orientation for weak H-bonded pairs is not 
relevant, since one residue always faces the inside of the barrel and the other always faces the outside, 
as shown in Figure 2. 

2.2.5 Comparison to soluble barrel and barrel-like /3-sheets 

In order to determine whether the pairwise propensities calculated for TM /3-barrels were due to the effects 
of the transmembrane environment or due to the intrinsic structural effects of /3-barrels, we calculated 
pairwise propensities for a set of soluble /3-barrels and barrel-like /3-sheets that were structurally similar 
to the TM /3-barrel dataset as determined by the Combinatorial Extension structural alignment method 
[38]. The soluble dataset contains 28 proteins. All of the proteins contain anti-parallel /3-sheets with 
a right-hand twist, similar to TM /3-barrels. Some open /3-barrels and other /3-sheets were included to 
ensure that the size of the dataset would be similar to that of the TM /3-barrels. The final dataset 
has approximately the same number of strand pairs, though about 40% fewer residue pairs. Significant 
pair propensities for the soluble /3-sheet dataset for each interaction type are listed in Supplementary 
Material. 

Only two motifs appear in both TM and soluble proteins: G-V non-bonded pairs (propensity 1.60 
in TM /3-barrels, 1.78 in soluble /3-sheets) and G-P weak H-bonds (1.78 TM, 3.75 soluble). In soluble 
/3-sheets, polar-polar and hydrophobic-hydrophobic pairs have high propensities for strong H-bond and 
non-H-bonded pairings, while polar-hydrophobic pairs have low propensities. The results for weak H- 
bonds show an opposing trend. The strongest motifs in TM /3-barrels, W-Y non-H-bonded pairs and 
G-Y strong H-bonds, are not statistically significant in soluble sheets. W-Y non-H-bonded pairs have 
propensity 1.06 and G-Y strong H-bonds have propensity 1.27 with p-value 0.55. 

Pairwise interhelical interactions in TM a-helical membrane proteins have also been studied using 
odds ratios [23]. The only favorable interaction motif found in both TM a- helices and TM /3-barrels is 
G-F, a strong H-bond motif (odds ratio 1.80) involved in aromatic rescue. Otherwise, there are very few 
similarities between the two protein families. 

2.3 Prediction of strand register 

The single and pairwise propensities can be useful for structure prediction. We have developed an 
algorithm to predict strand register in /3-barrel membrane proteins knowing only amino acid sequences 
and the approximate positions of the first residue in each strand. The latter can be obtained by several 
strand predictors, though they cannot themselves predict strand register. We use the hidden Markov 
model predictor of Bigelow et al. [7] . We also test our method when the actual strand starts are given 
beforehand. 

Our predictor consists of two steps. In step 1, we use the single-body regional propensities to refine 
the approximate strand starts given by the HMM predictor into "true" strand starts. It is only necessary 
for the strand predictor to predict strand starts within 5 residues of the exact strand start. In step 2, we 
use the two-body interstrand propensities (Tmsip) to predict final strand register for each strand pair. 
Details of the algorithm are discussed in Methods. 

The results of our predictor are reported in Table 4, where the proportion of strand pairs (out of 
256) whose register is correctly predicted is listed. When the strand starts predicted by the method of 
Bigelow et al. are given as input, the accuracy is 44%. When the actual strand starts from the PDB 
structures are given as input instead, the accuracy is 46%. This shows that only approximate strand 
starts are necessary for strand registration prediction. 

We compare our accuracy to a random control, in which we randomly select a window from the 11 
available positions for each strand and then randomly select a strand shear from the 11 possible values 
for each strand pair (as described in Methods). We ran this random selection 10 times on the dataset, 
and only 17.4 strand pairs were predicted on average (standard deviation 3.61). This translates to a 
random accuracy of 7%. Our accuracy of 44% is thus considerably better than random. 
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Method 


Strands 


Accuracy 


Full prediction 


112 


44% 


Step 1 only 


100 


39% 


Step 2 only 


92 


36% 


No prediction 


61 


24% 


Random 


17 


7% 



Table 4: Prediction of strand register in /3-barrel membrane proteins. We report the success of our tests as the number 
of strand pairs whose register is correctly predicted ("Strands") and the proportion out of the 256 strand pairs tested 
("Accuracy"). Listed are the accuracies for using both steps (single and pairwise propensities) as described in Methods 
("Full prediction"), using step 1 or 2 individually, skipping both steps ("No prediction"), or using randomly generated 
strand registers ( "Random" ) . 



We also compare our results to a similar prediction of strand register in soluble /3-sheets by Steward 
and Thornton [19]. This prediction uses the sum of single and pairwise scores to align one /3-strand 
against a fixed strand that it is known to pair with, and achieves an accuracy of 31% in antiparallel 
/3-sheets. These conditions are similar to ours when the actual strand starts are given as inputs and we 
skip step 1. We achieve an accuracy of 68% in TM /3-barrels under these conditions. 



3 Discussion 

We have described residue preferences for different regions and a number of spatial patterns for /3-barrel 
membrane proteins. Many of the novel patterns discovered in this study provide useful information about 
the assembly and folding process of /3-barrel membrane proteins. 

"Positive-outside" rule. Basic residues have an asymmetric distribution between the two cap 
regions of /3-barrel membrane proteins. Arg and Lys have propensities of 0.61 and 0.86 in the periplasmic 
cap region, respectively, but 1.54 and 1.54 in the extracellular cap region, and thus have over twice the 
preference to be found in the extracellular cap region than in the periplasmic cap region. This is notable 
for several reasons: first, one would intuitively expect charged residues to have high propensity in both 
cap regions, since they are surrounded by an aqueous environment; second, acidic residues have high 
and relatively similar propensities in both cap regions (Asp has a propensity of 1.83 for the periplasmic 
cap and 1.53 for the extracellular cap, and Glu has 1.20 and 1.32 for these regions, respectively); third, 
despite their poor propensity for the periplasmic cap, basic residues have very high propensities for the 
internal face of the periplasmic headgroup region (Arg 1.62, Lys 2.37), and often "snorkel" away from 
the center of the lipid bilayer and extend into the periplasmic space. 

A similar asymmetric distribution of basic residues in cap regions has been found in a-helical mem- 
brane proteins, yet the distribution is reversed: Basic residues have a high propensity for the inner, 
cytoplasmic cap, and a low propensity for the outer cap (facing the extracellular space in eukaryotes and 
gram-positive bacteria or facing the periplasm in gram-negative bacteria) [24] . This property was named 
the "positive-inside" rule. By analogy, /3-barrel membrane proteins thus show a "positive-outside" rule. 

A possible explanation for this result is that the asymmetric distribution of basic residues is related 
to the asymmetric lipid composition of the outer membrane. The inner leaflet of the outer membrane, 
which faces the periplasm, is composed primarily of phosphatidylethanolamine (PE), while the outer 
leaflet, which faces the extracellular environment, has a high concentration of lipopolysaccharide (LPS), 
which bears several negatively charged groups. The asymmetric distribution of basic residues within the 
protein may lead to a stable positioning of the protein within this asymmetric lipid bilayer. It has been 
suggested that /3-barrel membrane proteins may interact with LPS in the periplasm before co-inserting 
into the outer membrane [42] . If this is true, the high propensity of basic residues on the extracellular 
side of the protein would attract the highly negatively-charged LPS so that the protein will insert in the 
correct direction. In a recent study in which the structure of the TM /3-barrel of FhuA is co-crystallized 
with one LPS molecule, LPS binds strongly to basic residues on FhuA in the extracellular cap region, 
forming hydrogen bonds or ionic interactions [43,44]. Another study suggests that the frequency of 
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positively charged residues in the headgroup region is correlated to in vitro affinity of /3-barrel membrane 
proteins for different phospholipids. [45] . Thus, the affinity of the two sides of the barrel for the two outer 
membrane leaflets may be affected by the amino acid composition of the barrel and the lipid composition 
of the leaflets. 

Aromatic rescue of glycine in /3-barrel membrane proteins. Tyr-Gly is the most favorable 
interstrand motif for backbone H-bond interactions (propensity 1.56, p- value 8xl0~ 4 ). Of the 39 G- 
Y strong H-bonded pairs in the dataset, 32 are on the inside of the barrel, where Tyr is normally 
disfavored. Tyr adopts an unusual rotamer (60,90) in this location: 60% of internal Tyr side-chains 
adopt this rotamer, while only 8% of external Tyr side-chains and 6% of Tyr side-chains in soluble (3- 
strands do [40]. There is a single explanation for this rotamer preference and the unusual preference of 
Tyr for the inside of the barrel when it is part of a G-Y interaction. As shown in Figure 4b, the side-chain 
of Tyr has extensive contacts with and covers the Gly residue to which it is backbone H-bonded. Of the 
32 internal G-Y strong H-bonded pairs, 30 show this behavior, but only 1 of 7 external G-Y pairs do. 

Merkel and Regan discovered the same behavior in soluble /3-sheets, and named it "aromatic rescue" 
of glycine [20]. They found that aromatic rescue mitigates the instability Gly causes in /3-sheets. They 
proposed that the aromatic side-chain prevents the exposure of the backbone around Gly to solvent while 
at the same time minimizing the surface area of the aromatic ring exposed to solvent. This effect also 
accounts for the fact that this pattern occurs preferentially on the inside of the barrel in TM regions, 
where it is exposed to solvent or a polar environment. Phe behaves similarly to Tyr in this region. G-F 
is a significant strong H-bonded motif (propensity 1.80, p- value 5xl0 _ ) and displays aromatic rescue 
on the inside of the barrel. Likewise, the (60,90) rotamer is observed in 83% of internal Phe side-chains, 
but only in 19% of external Phe side-chains and 5% of Phe side-chains in soluble /3-strands [40]. 

Hutchinson et al. studied the relationship between H-bonded and non-H-bonded pairs in soluble 
/3-sheets, and found that G-F H-bonds were 2.63 times more likely than G-F non-H-bonded pairs [21]. 
This is similar to the ratio for TM /3-barrels (2.37). However, the ratio of G-Y for TM /3-barrels (2.93) is 
considerably higher than that for soluble /3-sheets (1.39). Tyr may have a much larger role in aromatic 
rescue in TM /3-barrel membrane proteins than soluble /3-sheets because of the higher polarity of Tyr 
side-chains compared to Phe. Aromatic rescue may be even more important for the internal surface of 
TM /3-barrels than for soluble /3-sheets, both because of the need to have polar residues such as Tyr 
on the internal surface and the need to have Gly in order to relieve the steric pressure caused by the 
curvature of the barrel. 

Structure prediction. We have incorporated our propensity values into a preliminary study pre- 
dicting strand pair register in TM /3-barrels. We find that our predictor can achieve significantly higher 
accuracy than random. It can achieve nearly the same accuracy by inputting strand starts approximated 
by the hidden Markov model predictor of Bigelow et al. [7] as when the true strand starts are known. 
Thus, only the amino acid sequence is necessary for strand pairing prediction if the method of Bigelow 
et al. is used first. Our accuracy is considerably better than random (44% vs. 7%), and is also better 
than just relying on the initial HMM prediction (24%). The predictor discussed in this study represents 
the first step towards a full structure prediction of TM /3-barrels. 

We also assessed the effects of using only knowledge of strand starts (step 1) or only strand pairing 
(step 2) . We can skip step 1 by not refining the strand starts given as input ( "Step 2 only" in Table 4). We 
find that the accuracy of the predictor decreases to 36% using start positions from the HMM predictor. 
This shows that the single-body regional preferences improve accuracy by 8%. We can skip step 2 ("Step 
1 only" in Table 4) by not using our pairwise interstrand propensities, and simply assigning the most 
frequently observed strand register as correct. In this case, the accuracy decreases to 39%. Clearly, both 
steps 1 and 2 are necessary to achieve maximal accuracy. When both steps 1 and 2 are skipped ("No 
prediction" in Table 4), the accuracy drops to 24%. 

Determining factors of folding and assembly. There are two main factors that determine the 
spatial patterns described in this study. As with a-helical membrane proteins, the folding and assembly 
of /3-barrel membrane proteins in the transmembrane region are governed by the thermodynamics of 
lipids and proteins and their interactions. On the other hand, the sorting and targeting process required 
for integrating membrane proteins into lipid bilayers in the correct direction after biosynthesis is an 
important process of the cell, and involves complex biomolecular machinery [12]. For helical membrane 
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proteins, the nature of such machinery is being uncovered by the study of translocons [47]. For /3-barrel 
proteins, recent work on chaperone-assisted folding hint at a very complex machinery as well [12]. 

Distinguishing spatial patterns due to fundamental thermodynamics from those due to the unique 
requirements of biological localization is a challenging task. The identification and understanding of the 
origins of these motifs will help to elucidate the folding mechanisms of /3-barrel proteins. An attractive 
hypothesis is that these two types of events can be discerned from further analysis of propensities, motifs, 
and antimotifs. 

Experimentally testable hypotheses. The identification of favorable pairing residues may be 
useful for suggesting experimental studies to elucidate determinants important for in vivo folding and to 
dissect the determinants of thermodynamic stability for /3-barrel membrane proteins. For example, Arg 
and Lys can be introduced in the periplasmic cap region, and if such mutants can fold with similar stability 
in in vitro experiments, this would suggest that the "positive-outside" rule may be important mostly 
for in vivo folding. In addition, measurement of the thermodynamic stabilities of /3-barrel proteins in 
reconstituted lipid bilayers of different composition can help to clarify whether the origin of the "positive- 
outside" rule is due to the asymmetric composition of phospholipid bilayers. 

Gly-Tyr interstrand pairs involved in aromatic rescue may serve as anchoring sites of /3-barrel mem- 
brane protein folding. To study details of the folding mechanism, one could remove and add anchoring 
Gly-Tyr residue pairs at strategic positions simultaneously in order to introduce maximal changes in the 
effective contact orders, which would significantly alter the zipping process of folding [49]. Folding rate 
studies of these mutated proteins would be valuable for elucidating the folding mechanism of /3-barrel 
proteins in general and the role of the high-propensity Gly-Tyr interaction in particular. The motifs, 
antimotifs, and spatial high propensity pairing described in this study can be used profitably for studying 
the folding and assembly mechanism of /3-barrel membrane proteins. 

Importance of statistical analysis for small datasets. Previous studies of residue pairing in 
soluble /3-sheets either lack a statistical model and hence provide no statistical significance in their results, 
or use simple models such as the \ 2 test f° r p- value calculation, which is inappropriate for studying the 
pairing of very short strands [18, 19, 21]. 

Because no existing textbook or published statistical models are applicable for analyzing short strand 
pairing, the key component of our method is the development of methods of exact calculation of p- values 
for the null model of exhaustive permutation of strand sequences. With the development of a higher-order 
generalized hypergeometric model based on the combinatorics of short sequences, we are able to identify 
very significant motifs and antimotifs from very limited data. The development of rigorous statistical 
methods is an important technical development of work reported in this paper. 

4 Model and Methods 

We sketch briefly our models and computational methods below. More details can be found in Supple- 
mentary Material). 

Database. The dataset used to derive the statistical models comprises 19 /3-barrel membrane proteins 
found in the Protein Data Bank (Table 5), totaling 262 /3-strands. All proteins share no more than 26% 
pairwise sequence identity. All structures have a resolution of 2.6 A or better. 

The dataset of soluble barrel and barrel-like /3-sheets was compiled by searching structural homologs 
using the Combinatorial Extension method [38] with each of the 19 proteins in the TM dataset. Because 
the resulting dataset was small, CE was used again on the two closed barrels obtained from the first 
attempt, streptavidin (pdb lstp) and retinol binding protein (pdb lbrp). As with the TM dataset, all 
proteins share no more than 26% pairwise sequence identity. All structures have a resolution of 3.3 A 
or better. The dataset comprises 28 soluble /3-sheets: lavg, layr, lbbp, lbj7, lbpo, lbrp, ldfv, ldmm, 
ld2u, leg9, lei5, lepa, lem2, lewf, lfsk, lfx3, lb.91, ljkg, llkf, lm6p, lqfv, lstd, lstp, lt27, luna, 
2a2u, 3blg, 4bcl. 

Single-body propensities. We define the single-body propensity P r (X) of residue type X in region 
r as the odds ratio comparing the frequency of a residue type in one region to its expected frequency 
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Table 5: Dataset of 19 /3-barrel membrane proteins used for this study. All proteins share no more than 26% pairwise 
sequence identity. Crystal structures have a resolution of 2.6 A or less. Three identical chains of TolC and seven of 
a-hemolysin form a single barrel; all other proteins listed form whole barrels with a single peptide chain. 



when all eight regions are combined: 



E[f'(X\r)Y 

where f(X\r) is the observed frequency (number count) of residue type X in region r, and E[/'(X|r)] is 
the expected frequency of residue type X in region r. 

In order to calculate E[/'(X|r)], we use a random null model of exhaustive permutation of all residues 
in all eight regions. Here each permutation occurs with equal probability. The probability Px|r(*) 
of i = f'(X\r) residues of type X being assigned to region r follows a hypergeometric distribution: 
^x\r(i) = h(i\n, n r ,n x ) = (™ x ) (^""^/(^ ) , where n is the number of residues of all types in all eight 
regions in the entire dataset, n r is the number of residues of all types in region r, and n x is the number 
of residues of type X in all regions. The expected frequency of residue type X occurring in region r is 
the mean of the hypergeometric distribution h(i\n, n r , n x ): 

E[/'(^k-)] = E i - p ^k( i ) = ! ^r i - (!) 

Therefore, the propensity P r (X) is: 

p r{x) _ f(X\r) _ f(X\r)/n r 



E[f'(X\r)] n x /n ' 

Because the frequency of occurrences of residues of type X in region r in the null model follows 
a hypergeometric distribution, we can calculate an exact p-value for an observed f(X\r) to assess its 
statistical significance. We calculate a two-tailed p-value based on the null hypothesis that P r (X) = 1.0. 

Determination of pairwise contacts. The three types of pairwise contacts (strong H-bond, side- 
chain, and weak H-bond) were assigned for interacting residues based on contact types defined by the 
DSSP program (Definition of Secondary Structure of Proteins [50]) based on the atomic coordinates from 
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PDB files. For /3-sheets, DSSP defines bridge partners as residues across from each other on adjacent 
/?-strands, and also determines whether bridge partners interact via backbone N-O H-bonds. Two TM 
residues on adjacent /3-strands contribute to a backbone strong H-bond interaction if they are listed as 
bridge partners that contribute to backbone H-bonds. They contribute to a non-H-bonded interaction 
if they are listed as bridge partners but do not contribute to backbone H-bonds. They contribute to a 
weak H-bond if the residue with the smaller residue number (i.e. closer to the N-terminus of the protein) 
is a bridge partner to the residue immediately following the other residue of the pair. This is because 
a weak Cc-0 H-bond extends by one residue in the N-C direction, as seen in Figure 2. Residues j and 
y contribute to a weak H-bond in Figure 2, for example, because the residue with the lower number (y, 
since it is closer to the N-terminus), is a bridge partner to i, the residue immediately following j. For 
a strand pair between the first and last strands of a protein, the larger and smaller numbered residues 
must be switched, to account for the fully circular nature of /3-barrels. Residues outside the TM regions 
(core and headgroup) were not considered in our calculations of pairwise contacts. 



Interstrand two-body spatial contact propensities. The interstrand pairwise propensity 
(Tmsip) P(X, Y) of residue types X and Y for each of the three types of pairwise contacts is given 
by: 

P(Y Y) = f( X ' Y ^ 
1 ' ' E[f'(X,Y)]' 

where f(X, Y) is the observed frequency of X-Y contacts of a specific type in the TM regions (core and 
headgroup), and E[f' (X,Y)] is the expected frequency of X-Y contacts in a null model. In Tables 2 and 
3, f(X, Y) is listed under "Obs.", and E[f'(X, Y)] is listed under "Exp." 

In order to calculate E[f'(X, Y)], we choose a null model in which residues within each of the two 
adjacent strands in a strand pair are permuted exhaustively and independently, and each permutation 
occurs with equal probability. 

Contacts between residues of the same type. When X is the same as Y, the probability fx,x{i) of i = 
f'(X, X) number of X-X contacts in a strand pair follows a hypergeometric distribution. E a n[/'(X, X)] 
is then the sum of the expected values of f'(X, X) for the set SV of all strand pairs in the dataset: 



E aII [/'(x,x)]= J2 nfs P (x,x)]= 



xi(sp) ■ x 2 (sp) 



l(sp) 

spesv spesv v ' 

where xi(sp) and X2{sp) are the numbers of residues of type X in the first and second strand of strand 
pair sp £ SV, respectively, and l(sp) is the length of strand pair sp. The right-hand side is determined 
by using the expectation of the hypergeometric distribution, analogous to Equation (1). For statistical 
significance, two-tailed p- values can be calculated using Vx,x- 

Contacts between residues of different types. If the two contacting residues are not of the same type, 
i.e. X Y, then the number of X-Y contacts in the random model for one strand pair is the sum of two 
dependent hypergeometric variables, one variable for type X residues in the first strand and type Y in 
the second strand, and another variable for type Y residues in the first strand and type X in the second 
strand. The expected frequency of X-Y contacts E[f'(X, Y)] is the sum of the two expected values over 
all strand pairs sp £ SV: 

nf(x,Y)]= £ { nfux,Y) ]+ nfUY,x)]}= £ + 

spesv spesv l(sp > l(sp > 

where X\(sp) and X2{sp) are the numbers of residues of type X in the first and second strand, yi(sp) 
and 2/2 (sp) are the numbers of residues of type Y in the first and second strand, and l(sp) is the length 
of strand pair sp. The right-hand side is determined by using the expectation of the hypergeometric 
distribution, analogous to Equation (1). A more complex hypergeometric formula for the null model is 
used for exact calculation of p- values (see details in Supplementary Material). 

Strand register prediction. Our algorithm consists of two steps, the prediction of exact strand 
starts and the prediction of strand register. In both steps, a probabilistic model is used which involves 
the summation of log- propensity values calculated in this study. 

For step 1, we use the single-body regional propensities. We begin with an initial postion of strand 
start, e.g., as predicted by the hidden Markov model of Bigelow et al. [7], and sum the log-propensities 
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of the surrounding amino acids for 11 windows: the window starting at the predicted strand start, and 
all windows within 5 residues from the original prediction. The window with the sum that reflects the 
highest probability is taken as the "true" strand start used in step 2. 

For step 2, we use the pairwise two-body propensities. We begin with two adjacent strands, using the 
strand starts predicted in step 1. We then shift one strand against another, and sum the log-propensities 
of the surrounding residue pairs for 11 windows (called "strand shears"): the window in which the two 
strand starts are bonded together, and all windows within 6 residues up and 4 residues down for the 
original position. This is because a strand shear of +1 is the most commonly occurring strand shear in 
the dataset. The window with the sum that reflects the highest probability is taken as the correct strand 
pairing. 

We exclude from our prediction analysis two proteins in our dataset (6 strand pairs) for which the 
HMM strand start predictor failed: TolC and a-HL. This leaves our dataset with 17 proteins comprising 
256 strands. We test our method in "leave-one-out" fashion, using 16 proteins to derive the log-propensity 
scores use to predict strand pairing in the 17th protein. The results are listed in Table 4. 
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6 Supplementary Material 

This is a self-containing expaned section on methods. 

Database. The dataset used to derive the statistical models comprises 19 /3-barrel membrane proteins 
found in the Protein Data Bank (Table 5), totaling 262 /3-strands. All proteins share no more than 26% 
pairwise sequence identity. All structures have a resolution of 2.6 A or better. 

The dataset of soluble barrel and barrel-like /3-sheets was compiled by searching structural homologs 
using the Combinatorial Extension method [38] with each of the 19 proteins in the TM dataset. Because 
the resulting dataset was small, CE was used again on the two closed barrels obtained from the first 
attempt, streptavidin (pdb lstp) and retinol binding protein (pdb lbrp). As with the TM dataset, all 
proteins share no more than 26% pairwise sequence identity. All structures have a resolution of 3.3 A 
or better. The dataset comprises 28 soluble /3-sheets: lavg, layr, lbbp, lbj7, lbpo, lbrp, ldfv, ldmm, 
ld2u, leg9, lei5, lepa, lem2, lewf, lfsk, lfx3, lb.91, ljkg, llkf, lm6p, lqfv, lstd, lstp, lt27, luna, 
2a2u, 3blg, 4bcl. 

Spatial Regions and Strand Model. In order to classify residues in /3-barrel membrane proteins 
by specific spatial regions (core, headgroup, and polar caps), the coordinates in the protein's PDB file 
were translated and rotated so that the xy-plane was perpendicular to the vertical axis of the barrel and 
equidistant to the observed aromatic girdles presumed to be at the membrane interfaces. An example of 
such a transformation is illustrated in Figure 1. Each residue in the protein was assigned a region based 
on the z-coordinate of its associated a-carbon. 

In addition, TM residues (those in the core or headgroup regions) were assigned as internal (i.e. their 
side-chains face into the center of the barrel) or external (i.e. their side-chains face away from the center 
of the barrel), depending on the angle between the vector from the barrel central axis to the a-carbon and 
the vector from the a-carbon to the /3-carbon of the residue. If the angle is less than 90° , it is classified 
as internal; if the angle is greater, it is classified as external. If the residue is glycine, its orientation is 
extrapolated as the opposite of the residue previous to it on the /3-strand. 

In total, each barrel is divided into 8 distinct regions: periplasmic cap with z 6 (— 20.5A, — 13.5A), 
periplasmic headgroup (internal and external) with z £ (— 13.5A, — 6.5A), core (internal and external) 
with z € (— 6.5A, 6.5A), extracellular headgroup (internal and external) with z 6 (6.5A, 13. 5A), and 
extracellular cap with 2 € (13. 5A, 20. 5A). Residues that are not assigned to be in /3-strands are auto- 
matically excluded from the TM (headgroup and core) regions. Unless noted, any residue in the protein 
not assigned to any of these regions was excluded from the calculation. 
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Single-body propensities. We define the single-body propensity P r (X) of residue type X in region 
r as the odds ratio comparing the frequency of a residue type in one region to its expected frequency 
when all eight regions are combined: 

P (X) - /(x|r) 

where f(X\r) is the observed frequency (number count) of residue type X in region r, and E[/'(X|r)] is 
the expected frequency of residue type X in region r. 

In order to calculate E[/'(X|r)], we need a random null model. Here we chose as our model exhaustive 
permutation of all residues in all eight regions, such that each permutation occurs with equal probability. 
That is, the residues in all eight regions of all proteins in the dataset are permuted exhaustively, without 
replacement, and assigned regions based on their new positions. For each permutation, f'(X\r) records 
the number of occurrences of residue type X in region r. Under this model, the probability fx\r(i) of 
i = f'(X\r) residues of type X being assigned to region r follows a hypergeometric distribution: 

F x \ r (i) = h(i\n,n r ,n x ) = - " r ~' , 

\n r / 

where n is the number of residues of all types in all eight regions in the entire dataset, n r is the number 
of residues of all types in region r, and n x is the number of residues of type X in all regions. Analogously, 
we can think of F x \ r (i) as the probability of selecting without replacement n r residues out of a total of 
n residues contained in an urn, such that i of the residues are of type X. The expected frequency of 
residue type X occurring in region r is the mean of the hypergeometric distribution h(i\n, n r , n x ): 

E[f'(X\r)] = ^i.W xlr (i) = ^^. (2) 

i = 

Therefore, the propensity P r (X) is: 

f(X\r) 



f(X\r) 



Pr{X) = 



E[/'(A»] 



That is, P r (X) is the ratio of the proportion of residue type X in region r to the proportion of residue 
type X in all eight regions combined. 

Because the frequency of occurrences of residues of type X in region r in the null model follows 
a hypergeometric distribution, we can calculate an exact p-value for an observed f(X\r) to assess its 
statistical significance. We calculate a two-tailed p- value based on the null hypothesis that P r (X) = 1.0. 
If the observed P r (X) < 1.0, then: 

f(X\r) 

P = 2- £ Px\r(j)- (3) 

i=0 

If the observed P r (X) > 1.0, then: 

X n 

P = 2- Yl p *kW- ( 4 ) 

i=f(X\r) 

That is, the p-value is the probability that the propensity calculated from the dataset would deviate as 
much or more from the observed propensity, higher or lower, assuming that the actual propensity is 1.0. 



Determination of pairwise contacts. The three types of pairwise contacts (strong H-bond, side- 
chain, and weak H-bond) were assigned for interacting residues based on contact types defined by the 
DSSP program (Definition of Secondary Structure of Proteins [50]) based on the atomic coordinates from 
PDB files. For /3-sheets, DSSP defines bridge partners as residues across from each other on adjacent 
/3-strands, and also determines whether bridge partners interact via backbone N-O H-bonds. Two TM 
residues on adjacent /3-strands contribute to a backbone strong H-bond interaction if they are listed as 
bridge partners that contribute to backbone H-bonds. They contribute to a non-H-bonded interaction 
if they are listed as bridge partners but do not contribute to backbone H-bonds. They contribute to a 
weak H-bond if the residue with the smaller residue number (i.e. closer to the N-terminus of the protein) 
is a bridge partner to the residue immediately following the other residue of the pair. This is because 
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a weak C a -0 H-bond extends by one residue in the N-C direction, as seen in Figure 2. Residues j and 
y contribute to a weak H-bond in Figure 2, for example, because the residue with the lower number (y, 
since it is closer to the N-terminus), is a bridge partner to i, the residue immediately following j. For 
a strand pair between the first and last strands of a protein, the larger and smaller numbered residues 
must be switched, to account for the fully circular nature of /3-barrels. Residues outside the TM regions 
(core and headgroup) were not considered in our calculations of pairwise contacts. 



Interstrand two-body spatial contact propensities. The interstrand pairwise propensity 
(Tmsip) P(X, Y) of residue types X and Y for each of the three types of pairwise contacts is given 
by: 

p(X Y) = f( X ' ^ 
{ ' ' E[f'(X,Y)}' 

where f(X, Y) is the observed frequency of X-Y contacts of a specific type in the TM regions (core and 
headgroup), and E[f' (X,Y)] is the expected frequency of X-Y contacts in a null model. In Tables 2 and 
3, f(X, Y) is listed under "Obs.", and E[f'{X, Y)] is listed under "Exp." 

In order to calculate E[f'(X, Y)], we choose a null model in which residues within each of the two 
adjacent strands in a strand pair are permuted exhaustively and independently, and each permutation 
occurs with equal probability. In this null model, an X-Y contact forms if in a permuted strand pair two 
contacting residues happen to be type X and type Y. E[f'(x,y)] is then the expected number of X-Y 
contacts over the entire dataset. 

Contacts between residues of the same type. When X is the same as Y, the probability Px,x(i) of 
i — f'(X,X) number of X-X contacts in a strand pair follows a hypergeometric distribution: 

V x , x (i)= 2 



where X\ is the number of residues of type X on the first strand, x 2 is the number of residues of type 
X on the second strand, and I is the length of the strand pair (the lengths of the two strands must be 
equal). This mimics the random selection of residues from one strand to pair up with residues from the 
other strand. 

This is the same as picking x 2 balls from an urn of I balls, i of which are of type X and X2 — i of 
which are not type X. In this case, the urn represents the first strand, containing xi residues of type X 
and I — x\ residues not of type X. The X2 balls picked from the urn represent the residues in the first 
strand selected to be placed adjacent to the X2 residues of type X in the second strand, i of which are of 
type X, and X2 — i of which are not of type X. 

E a n[/'(X, X)] is then the sum of the expected values of f'(X, X) for the set SV of all strand pairs in 
the dataset. 

xi(sp) ■ x 2 (sp) 



E M [f'(X,X)]= E SP [/'(X,X)]= Y, 



l(sp) 

where xi(sp) and X2(sp) are the numbers of residues of type X in the first and second strand of strand 
pair sp £ SV, respectively, and l(sp) is the length of strand pair sp. The right-hand side is determined 
by using the expectation of the hypergeometric distribution, analogous to Equation (2). For statistical 
significance, two-tailed p- values can be calculated using formulae similar to Equations (3) and (4). 

Contacts between residues of different types. If the two contacting residues are not of the same type, 
i.e. X 7^ Y, then the number of X-Y contacts in the random model for one strand pair is the sum of two 
dependent hypergeometric variables, one variable for type X residues in the first strand and type Y in 
the second strand, and another variable for type Y residues in the first strand and type X in the second 
strand. The expected frequency of X-Y contacts E[f'(X, Y)] is the sum of the two expected values over 
all strand pairs sp £ SV: 

nf(X,Y)]= £ {E [ f' sp (X,Y) ]+ E [ f sp (Y,X) ]} = £ { X - 1 ^^+ y -^S^^ 
spesv spesv l{sp> l{sp> 

where xi(sp) and X2(sp) are the numbers of residues of type X in the first and second strand, yi(sp) 
and ?/2 (sp) are the numbers of residues of type Y in the first and second strand, and l(sp) is the length 
of strand pair sp. The right-hand side is determined by using the expectation of the hypergeometric 
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distribution, analogous to Equation (2). Despite the fact that the variables f' ap (X,Y) and f' ap (Y,X) are 
dependent (i.e. the placement of an X-Y pair may affect the probability of a Y-X pair in the same strand 
pair), their expectations may be summed directly, because expectation is a linear operator. 

Generalized hypergeometric model. However, because f' sp (X,Y) and f' Bp (Y,X) are dependent, to 
determine p-values for a specific number of observed X-Y contacts, a more complex hypergeometric 
formula for the null model must be established. The probability of a specific number of X-Y contacts 
occurring in one strand pair does not follow a simple hypergeometric distribution. Here we develop a 
generalized hypergeometric model based on the trinomial coefficient to characterize such a probability. 
First, we have a 3-element trinomial function (o, b, c)! defined as: 



/ t \ § (a + b + c)\ 

{a,b,c)\ = i 

aW.ci 



a\b\c\ 

It represents the number of distinct permutations in a multiset of three different types of elements, with 
number count a, b, and c for each of the three element types. Consider residues in the first strand of 
length / of a strand pair. These / residues are of three types: xi count of type X residues, yi of type Y 
residues , and l — x\—y\ count of type "neither" . If we exhaustively permute the I residues, we have the 
trinomial coefficient number of different permutations. We denote this as: 

T(l,xi,yi) = (xi,yi,l - xi - yi)\. 

We now first fix the positions of residues on strand 1, and permute exhaustively all matching I residues 
on strand 2. Let X2,y2, and / — x 2 — y2 be the numbers of residue of type X, Y , and "neither" on strand 
2, respectively. The total number of permutations for strand 2 is: 

T(l, X2, 1/2 ) = (X2, y2, 1 - x 2 - 1/2)!. 

Consider the residues on strand 2 that match to the x\ number of residues of type X on strand 1. 
(This and all further descriptions are illustrated in Figure 5 for clarification.) These x\ residues on strand 
2 consist of h number of type X residues, i number of type Y residues, and xi — h — i number of type 
"neither" residues. They can be permuted in 

T(xi, h, i) = (h, i, xi — h — i)l 

different ways. By analogy, the y\ residues on strand 2 that match type Y residues in strand I consist 
of j number of type X residues, k number of type Y residues, and 1/1 — j — k of type "neither" residues, 
and thus the total number of permutations for these 3/1 residues is: 

T(yi,j, k) = {j,k,yi - j - k)l. 

Similarly, there are T(l — xi — yi, X2 — h — j, y2 — i — k) number of permutations to match the remaining 
I — xi — yi of type "neither" residues on strand f . 

We characterize the probability P(h,i,j,k) of interstrand matches: a) the xi type X residues on 
strand I with h type X residues, i type Y residues, and x\ — h — i type "neither" residues on strand 2; 
b) the y\ type Y residues on strand f with j type X residues, k type Y residues, and yi — j — k type 
"neither" residues on strand 2; and c) the remaining l — xi—yi type "neither" residues on strand 1 with 
X2 — h — j type X residues, j/2 — i — k type Y residues, and the remaining type "neither" residues from 
strand 2. Equivalently, P(/i, i,j, k) is the probability of h X-X contacts, i X-Y contacts, j Y-X contacts, 
and k Y-Y contacts occurring in a random permutation. 

We introduce a higher order hypergeometric distribution for ¥(h,i,j,k) as follows: 

w , ■ ■ ,n _ T(xi,h,i) ■ T(y u j, k)-T(l - xi - yi,x 2 -h-j,y 2 -i-k) 

if(n,i,j,K) — — — — . 

T(l,x 2 ,y 2 ) 

This can be illustrated as follows. When randomly picking X2 of type X residues, 1/2 of type Y residues, 
and the remaining I — X2 — y2 type "neither" residues from an urn for strand 2, we have: (f) those 
matching the xi residues of type X on strand f are of h number of type X, i number of type Y, and 
X\ — h — i of type "neither" ; (2) those matching the yi residues of type Y on strand I are of j number of 
type X, k number of type Y, and X2 — j — k of type "neither"; and (3) those matching the I — xi — yi 
residues of type "neither" on strand f are of X2 — h — j number of type X, 1/2 — i — k number of type Y, 
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X, 



i-\-y x 



x l —h—i 

j 
k 

y^-j-k 

x 2 -h-j 
y 2 -i-k 



Figure 5: Illustration of the null model for interstrand spatial motifs when X j^Y. White represents X residues, black 
Y residues, and grey "neither" residues {i.e. neither X nor Y). X-Y motifs are represented by the i residue pairs in which 
there is an X residue in the first strand and a Y residue in the second, and by the j residue pairs in which there is a Y 
residue in the first strand and an X residue in the second. 



and (Zi — xi — yi) — (X2 — h — j) — (3/2 — i — k) of type "neither". 

The marginal probability Px,y(m) that there are a total of i + j = m X-Y contacts in the random 
model, namely, the pairings where a residue of type X in the first strand is paired with a residue of type 
Y in the second strand, summed with the pairings in which a residue of type Y in the first strand is 
paired with a residue of type X in the second strand is: 

wi- 

x\ xi— h (m—i) 

Px,y(m) = ^ J2 Yl nh,i,m-i,k), 

h=0 i=0 fc=0 

where h is the number of matched X-X contacts, i the number of matched X-Y contacts, m—i the number 
of matched Y-X contacts(j in Figure 5), and k the number of matched Y-Y contacts. The remaining 
contacts involving residues of type "neither" will then automatically be assigned, since all matches 
involving X and Y have been accounted for. There are x\ possible values for h, one for each residue of 
type X on strand 1; x\ — h possible values for i, once h has been determined; and yi — j = y — (rn — i) 
possible values for k, once i has been determined. The i number of X-Y contacts plus the m — i number 
of Y-X contacts will sum to the m number of contacts desired. This closed-form formula allows us to 
calculate analytically the two-tailed p- value for this null model of f'(X, Y) number of observed X-Y 
contacts using formulae similar to Equations (3) and (4). 

Confounding between single-body propensity and interstrand two-body propensity. Because single-body 
propensities can vary significantly, it is possible that differences in two-body propensities may simply 
be reflections of differences in single-body propensities, e.g. two polar residues might have high strong 
H-bond pairwise propensities simply because both residues have an independent preference for the same 
side of the TM barrel (internal- facing) , and not because of any direct significant propensity between 
the two. This artifact can be eliminated by dividing each strand into two "substrands," each of which 
contains only residues facing the same direction. This correction is automatic for strong H-bonds and 
non-H-bonded interactions, as all of the residues participating in each of these interactions in a single 
strand pair face the same direction {e.g. the residues in a particular strand pair participating in a strong 
H-bond must either all be internal or all be external). For weak H-bond interactions, in which one 
residue is internal and one is external, each strand pair must be divided into two substrand pairs: one 
pair in which the first substrand is internal and the second is external, and another pair in which the 
first substrand is external and the second is internal. In this way, the often dominating effects of single- 
residue orientation are removed from two-body propensity calculation. Results reported in Tables 2 and 
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3 are obtained after these corrections. For the analysis performed on the soluble /3-sheet dataset, there 
is no strong distinction between internal and external residues, since only some of the proteins are closed 
barrels. Thus, no correction was used for weak H-bonds. 



Pairwise propensities for a reduced alphabet. To obtain an objective reduced alphabet of 
amino acids for studying membrane /3-barrel proteins, we cluster amino acids by their location preference 
and strand pairing propensities. We represent each amino acid as a vector and use hierarchical clustering 
to define residue groups. Each vector consists of 68 z-scores: one for each of the 20 pairwise contact 
propensities including the residue for each of the 3 interaction types (strong H-bonds, non-H-bondcd 
interactions, and weak H-bonds), and one for each single-body regional propensity of the residue in each 
of the 8 regions. The z-scores for pairwise propensities are calculated as 

.(ffl. MM , (5, 
Vvar[/'(X,y)] 

and the z-scores for single-body propensities are calculated as 

f(X\r)-E[f'(X\r)] 



z(X\r) = 



^vax[f'(X\r)} 



We use hierarchical clustering by average linkage with a Euclidean distance function to obtain the 
clustering shown in Figure 3. We place the distance threshold so an alphabet of 5 residues is formed. 



Strand register prediction. Our algorithm consists of two steps, the prediction of exact strand 
starts and the prediction of strand register. We use two sequence models for these two tasks, namely, a 16- 
residue single strand model and a 9-residue TM strand pair model (Figure 6) . The regional designations 
in the 16-residue canonical strand model are based on known physical attributes of TM /3-strands: an 
average length of 9-10 residues in the headgroup and core regions, and an alternating internal-external 
pattern. The size of each region (the cap regions, headgroup regions, and core region) is determined by 
dividing the total number of residues in a particular region by the number of strands in the dataset. 
We have 4 residues for the extracellular cap region, 3 for the periplasmic cap region, 2 each for the two 
headgroup regions, and 5 for the core region. 

The 9-residue strand pair model is derived from the canonical 16-residue models for two adjacent 
strands. These 9 residues are those designated to be in the transmembrane region, i.e., in the headgroup 
or core regions. We exclude the 7 cap residues because the cap regions do not contribute to strand 
pairing. We also incorporate the physical properties of the 3 types of strand interactions in the model: 
The strong H-bond and non-H-bonded interactions must alternate, and the weak H-bonds must extend 
one residue in the N-C direction. Due to chirality constraints in antiparallel /3-sheets (i.e. all amino acids 
in biological proteins are L-amino acids), the backbone H-bonding pattern is fixed once the N-C direction 
of the strands and the internal-external pattern are determined by step 1. We describe steps 1 and 2 in 
more detail: 

1 Predict exact starts. For a 16-residue window fitted to our strand model (Figure 6a), we calculate 
the single strand energy E(s; i) in kT units for strand i: 

16 

E(s;i) = -£)lnP(fc;t,«), 
fc=i 

where s is the displacement of the window position and P(k; i, s) is the single body propensity for 
the fe-th residue in this window of strand i, where the residues are alternatively assigned as internal 
and external. We calculate the energy score for a total of 11 possible windows: the window starting 
at the given approximated strand start s = (either provided beforehand or calculated by a strand 
predictor), and all windows 5 residues up (s < +5) or down (s > —5) in the amino acid sequence. 
Because there are two possibilities for the internal-external pattern of TM residues in our model, 
we calculated strand energy scores for both possibilities for each window. We take the window with 
the lowest energy as the exact strand start that we use in step 2. The prediction also identifies 
which residues are internal and which are external. 
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Figure 6: Illustrations of the models for the strand register prediction algorithm, a) 16-residue TM /3-strand model. Two 
models are scored for each window with different internal-external designations (hatch marks), b) 9-residue strand pair 
model. The example shown has a strand shear of +1. One strand is shifted up or down against the other (fixed) strand 
to score different strand shears. 



2 Predict strand register. After excluding the 7 cap residues, we fit the 9-residue windows for two 
adjacent strands to the strand pair model (Figure 6b), and calculate the strand pairing energy 
E(s; i, i + 1) in kT units for the strand pair (i, i + 1) with strand shear s as 

E(s;i,i+ 1) = — ^2 \nPsHB(k; + s) - ^ lnPjvs(fc;i,i + 1, s) 

- ^2 \nP W HB{k; i,i + l,s) + a\ — ^ 2 - s\. 

Here i and i + 1 are the labels of the two adjacent strands, P(k;i,i + 1, s) is the propensity of 
pairing between the fc-th residues in strand i and strand i + 1, and the strand shear s is the 
residue displacement between the starts of the two strands. The first 3 terms refer to each of the 
3 interaction types: strong H-bonds (SHB), non-H-bonded interactions (NB), and weak H-bonds 
(WHB). The last term represents a penalty to the score when the strand shear s deviates from the 
average strand shear, approximated as the average shearing number N + 2 of known TM /^-barrels 
divided by the number of strands N. a is a coefficient determined computationally. We find that 
a = 2 works well. 

We calculate the strand pairing score for a total of 11 possible windows by sliding one strand in 
a pair against the other strand: the windows with a strand shear of s = — 4 residues up to a 
strand shear of s — +6. We also exclude strand shears that place internal residues next to external 
residues, as this would violate the H-bonding patterns of /3-sheets. This reduces the search space by 
half. The strand shear s with the lowest strand pairing energy is taken as the true strand register. 

To approximate the strand starts of the /3-barrel membrane proteins in our dataset, we use the hidden 
Markov model predictor of Bigelow et al. [7], one of the most successful predictors for strand starts. This 
predictor uses only the amino acid sequence as input, and outputs a designation for each residue from a 
list of 4 possibilities: "up-strand" (referring to an odd-numbered strand), "down-strand" (even- numbered 
strand), loop between an up- and down-strand, and loop between a down- and up-strand. This predictor 
will therefore also predict the number of strands in the protein ab initio. We use the first instances of 
the up- and down-strand designations for each strand as approximate strand starts. 

We exclude from our prediction analysis two proteins in our dataset (6 strand pairs) for which the 
strand start predictor failed: TolC and a-HL. For TolC, the predictor did not detect any TM strands. For 
a-HL, the predictor detected far too many strands (8 instead of the actual 2 strands). For the remaining 
17 of the 19 proteins, the hidden Markov model correctly predicted 252 of the 256 strands, with only 4 
false positives. The false positive strands were all short (7 residues) and had irregular composition. Our 
register prediction was also incorrect for all strand pairs involving these 4 false positives. We apply our 
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prediction to the 17 /3-barrel membrane proteins in leave-one-out lashion: we calculate single and pairwise 
propensities using 16 of the proteins, and use them to predict strand registers in the 17th protein. Since 
the pairwise propensities are derived from a very small dataset, we introduce a pseudocount of 1 to the 
observed and expected numbers of each pair when calculating propensities. 
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